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Abstract 

This  paper  describes  the  second  participation  of  BJUT 
in  the  temporal  summarization  track.  We  performed  the 
experiments  on  the  TREC  KBA  2014  stream  corpus  us¬ 
ing  the  classic  information  retrieval  models,  such  as  B- 
M25,  vector  space  model.  Also,  we  introduce  the  details 
of  our  system,  which  consists  of  corpus  pre-processing, 
information  retrieval  module  and  information  process 
module. 

Introduction 

The  TREC  Temporal  Summarization  Track  runs  for  the  sec¬ 
ond  time  in  this  year,  and  different  from  2013  [  1  ],  this  year’s 
track  focuses  on  only  one  task:  Sequential  Updates  Sum¬ 
marization.  All  participants  should  answer  a  query  based  on 
topic  a  set  of  relevant  and  novel  sentences  ranked  by  time 
from  a  time-ordered  stream  of  documents,  through  which 
users  can  efficiently  monitor  the  information  associated  with 
an  event  (such  as  a  natural  disaster)  over  time[2].  Another 
primary  difference  lies  in  the  data  size,  which  reduced  to 
559G  from  4.5T.  The  corpus,  namely  TREC-TS-2014F,  is 
a  filtered  version  of  the  full  track.  It  is  designed  for  using 
by  participants  of  the  TREC-TS  track,  aiming  to  provide  a 
dataset  with  which  groups  can  participate  in  the  TREC-TS 
track  without  having  to  process  the  full  corpus.  The  corpus 
consists  of  a  set  of  time  stamped  documents  from  a  variety 
of  news  and  social  media  sources  covering  the  time  period 
from  Oct.2011  to  Apr.  2013.  A  document  contains  a  set  of 
sentences,  each  with  a  unique  identifier. 

SYSTEM 

According  to  the  task  of  Temporal  Summarization  Track, 
system  should  emit  relevant,  important  and  novel  sentences 
to  a  specific  event.  We  submitted  three  runs  for  this  task  and 
the  implementation  framework  of  our  system  is  shown  in 
FIG.  1. 

As  shown  in  FIG.  1,  the  framework  of  our  temporal 
summarization  system  can  be  described  as  follows,  which 
mainly  includes  corpus  pre-processing,  information  retrieval 
module  and  information  process  module. 

•  Information  pre-processing  module 

The  corpus  downloaded  locally  from  streamcorpus  — 
2014  — i>0_3_0  —  ts  —  filtered[3]  is  encrypted  file,  which 


cannot  be  used  directly.  In  this  sense,  firstly,  decrypting 
the  files  uses  the  authorized  key  and  converts  the  .GPG  file 
format  to  .SC  file  format;  Secondly,  parsing  the  .SC  files 
use  stream  corpus  toolbox  to  .TXT  files.  The  stream  cor¬ 
pus  toolbox  is  given  by  TREC  and  provides  a  common  da¬ 
ta  interchange  format  for  document  processing  pipelines 
that  apply  language-processing  tools  to  large  streams  of 
text. 

•  Information  retrieval  module 

Firstly,  building  index  for  the  .TXT  files.  Then,  combin¬ 
ing  with  query  expansion  for  retrieval  to  get  relevant  sen¬ 
tences. 

•  Information  process  module 

Text  similarity  and  clustering  can  improve  the  accuracy 
and  recall  rate  of  the  retrieval  results.  We  used  two  meth¬ 
ods  to  complete  this  part.  After  the  topic  clustering,  the 
centers  of  the  different  clustering  are  chosen  to  build  the 
summarizations.  Then  the  summarizations  are  ranked  by 
time  factor  and  similarity  factor. 

The  frame  with  solid  line  in  FIG.  1  is  the  main  methods 
we  used  for  the  temporal  summarization  track.  The  details  of 
our  work  will  be  introduced  in  the  next  section.  We  emphasis 
on  the  description  of  the  two  key  parts:  information  retrieval 
module  and  information  process  module. 

Information  Retrieval  Module 

•  Information  Retrieval 

In  this  part,  we  use  Lemur[4]  for  information  indexing 
and  retrieval.  Femur  is  a  toolkit  designed  to  facilitate  re¬ 
search  in  language  modeling  and  information  retrieval 
(IR).  It  supports  the  construction  of  basic  text  retrieval 
systems  using  language  modeling  methods  such  as  BM25 
[5].  Our  experiment  has  two  steps  to  build  the  index.  First, 
create  a  parameter  file  tell  the  lemur  toolkit  how  to  index; 
Secondly,  use  IndriBuildIndex.exe  application  to  build  in¬ 
dex.  Accordingly,  the  realization  of  retrieval  also  has  two 
steps.  First,  create  a  parameter  file  tell  the  lemur  toolkit 
how  to  retrieve;  Second,  use  IndriRunquery.exe  applica¬ 
tion  to  retrieve. 

•  Query  expansion 

Generally,  when  we  retrieving  information  based  on  query 
words,  there  always  exists  one  problem,  namely  word 
mismatch,  which  can  be  explained  that  people  often  use 
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Figure  1 :  The  framework  of  the  temporal  summarization  track. 


different  words  to  describe  the  same  concepts  between  the 
queries  and  documents.  To  solve  this  problem,  in  this  pa¬ 
per  we  used  a  method  called  query  expansion.  Query  ex¬ 
pansion  can  augment  a  query  word,  and  through  which, 
the  query  word  can  match  more  sentences,  thus  potential¬ 
ly  increasing  the  number  of  relevant  results.  The  query 
word  is  extended  by  using  words  with  similar  meaning  to 
those  in  the  query,  and  the  chance  of  matching  words  in 
the  relevant  documents  is  therefore  increased. 

Information  Process  Module 

After  information  retrieval  module,  we  got  a  set  of  sentences 
related  to  a  topic.  Considering  the  large  amount  of  these  sen¬ 
tences,  in  order  to  simplify  the  computation,  we  first  judge 
whether  these  sentences  are  in  the  range  of  each  topic’s  be¬ 
gin  time  and  end  time.  If  a  sentence  is  not  in  the  time  period, 
abandon  it,  and  the  rest  sentences  can  be  treated  as  candi¬ 
date  sentences,  which  should  be  had  further  processes  such 
as  text  similarity  calculation  and  text  clustering. 

•  Text  Similarity 

We  used  two  methods  based  on  Vector  Space  Model  (VS- 
M)[6],  which  is  composed  of  eigen  values  extracted  from 
documents  and  its  weight.  Vector  Space  Models  is  an  al¬ 
gebraic  model  for  representing  text  documents  (and  any 
objects,  in  general)  as  vectors  of  identifiers,  such  as,  for 
example,  index  terms.  It  is  used  in  information  filtering, 
information  retrieval,  indexing  and  relevancy  rankings.  In 
VSM,  sentences  and  queries  are  represented  as  vectors: 

di  =  (w1j,w2j,-  ■  ■  ,mj)  (1) 

Each  dimension  corresponds  to  a  separate  term.  If  a  ter- 
m  occurs  in  the  sentence,  its  value  in  the  vector  is  non¬ 
zero.  There  are  several  ways  to  compute  these  values  (also 


called  weight),  and  in  this  paper,  we  use  tf-idf  weighting. 
The  VSM  model  is  known  as  the  term  frequency-inverse 
document  frequency  model.  The  weight  vector  for  docu¬ 
ment  distance: 


Vd  =  (tUl,d>  22,<j,  •  •  •  ,  WN,d)  (2) 


Where 


wt,d  =  tft'dlog{ 


\D\ 


|K  G  D\t  G  d'}\ 


(3) 


And  tfttd  is  term  frequency  of  term  t  in  document  d  (a  lo¬ 
cal  parameter),  log  eD\ted'}\  *s  inverse  document  fre¬ 
quency  (a  global  parameter).  \D\  is  the  total  number  of 
documents  in  the  document  set;  {d!  G  D\t  G  d'} |  is  the 
number  of  documents  containing  the  term  t. 

After  getting  the  vectors,  the  VSM  similarity  between  two 
documents  can  be  calculated  by  using  the  cosine  distance: 


sim(di,  dj ) 


di  ■  dj 

11^11-11^11 


(4) 


Another  method  we  used  to  calculate  similarity  is 
based  on  mutual  information  preserving  mapping  (MIPF), 
which  is  a  manifold  learning  algorithm  that  computes 
low-dimensional,  neighborhood-preserving  based  on  mu¬ 
tual  information  of  high-dimensional  inputs.  With  suf¬ 
ficient  data  set,  we  expect  each  document  text  can  be 
expressed  as  its  neighbors’  mutual  information  and  its 
neighbors  are  lie  on  or  close  to  a  locally  linear  patch  of  the 
manifold.  Then  each  text  data  can  be  reconstructed  from 
its  neighbors  which  are  based  on  information  content.  Re¬ 
construction  errors  are  measured  by  the  cost  function 

T,i\I{Xi)  -Y.WijliXi-Xj)^ 

3 


(5) 


Table  1:  Experimental  Result. 
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In  order  to  minimize  the  reconstruction  errors,  we  can  get 
the  weight  W.  By  using  the  weight,  the  low-dimensional 
vector  Y  can  be  measured  by  the  embedding  cost  func¬ 
tion: 

WijliY-Yj)  |2  (6) 

i  j 

Unlike  traditional  manifold  methods  for  images,  MIPF 
applies  on  text  field,  and  its  optimizations  use  the  relation¬ 
ship  between  different  texts.  Meanwhile  it  remains  the  ad¬ 
vantages  of  the  manifold  method.  By  exploiting  the  local 
symmetries  of  reconstructions,  MIPF  is  able  to  learn  the 
global  structure  of  mutual-information-based  manifolds, 
such  as  those  generated  by  documents  of  text. 

Text  Clustering 

The  fc-means  [7]  clustering  is  chosen  after  many  exper¬ 
iments.  The  fc-means  clustering  is  a  popular  method  for 
cluster  analysis  in  data  mining.  The  /.-means  clustering 
aims  to  partition  n  observations  into  k  clusters,  in  which 
each  observation  belongs  to  the  cluster  with  the  nearest 
mean,  serving  as  a  prototype  of  the  cluster. 

Sentence  Selection 

After  text  clustering,  we  can  get  the  clusters  based  on  top¬ 
ics  between  different  events  for  information  expansion. 
We  choose  the  centers  of  the  clusters  and  the  top  sen¬ 
tences  as  the  summarization.  Finally  each  event  we  totally 
choose  about  100  sentences  from  the  thousands  sentences. 
The  last  step  is  to  rank  these  central  sentences.  Time  and 
similarity  are  the  two  factors  that  used  to  rank  the  summa- 
rizations.  After  this  step,  the  final  temporal  summarization 
can  be  obtained. 


EXPERIMENT  RESULTS 

Evaluation 

According  the  TREC  authority,  there  are  three  metrics: 

•  Expected  Gain.  One  way  to  evaluate  an  update  system  is 
to  measure  the  expected  gain  for  a  system  update.  This  is 
similar  to  traditional  notions  of  precision  in  information 
retrieval  evaluation. 

•  Comprehensiveness.  Similar  to  tradition  notions  of  recall 
in  information  retrieval  evaluation. 

•  F  measure.  In  order  to  summarize  expected  gain  and  com¬ 
prehensiveness,  we  use  an  F  measure  based  on  both  Ex¬ 
pected  Gain  and  Comprehensiveness. 

Results 

Table  1  shows  the  results  of  our  system.  In  the  first  line  of 
Table  1,  EG  signifies  the  scores  of  the  expected  gain,  C  sig¬ 
nifies  the  scores  of  the  comprehensiveness,  F  signifies  the 
scores  of  F  measure.  In  the  second  row,  Q0  and  Q1  is  the 
runs  we  submitted,  AVG  is  the  mean  score  for  each  topic 
over  all  runs  submitted  to  the  track.  In  the  first  column  of 
Table  1,  the  meaning  of  per-topic  is  obviously,  mean  signi¬ 
fies  the  average  values  of  the  scores  over  the  15  topics  are 
given  for  each  run.  In  the  second  column  of  Table  1,  All  sig¬ 
nifies  the  mean  score  over  all  topics  and  all  runs  submitted 
to  the  track. 

Through  Table  1,  the  performance  of  Q0  and  Q1  with  re¬ 
spect  to  the  metrics  Expected  Gain  and  F  measure  are  most¬ 
ly  better  than  AVG,  which  means  that  our  methods  are  effec¬ 
tively.  However,  there  are  several  topics  whose  Comprehen¬ 
siveness  value  is  smaller  than  the  AVG,  which  means  that 
our  methods  are  not  so  well  in  recall.  Through  the  contrast 
of  the  last  three  lines,  we  come  to  the  conclusion  that  expect 


Comprehensiveness,  our  run’s  performance  is  better  than  the 
average. 


Conclusion 

In  this  paper,  we  presented  the  implementation  details  of  our 
runs  for  Temporal  Summarization  Track,  and  our  runs  per¬ 
formed  well  respect  to  Expected  Gain  and  F  score,  but  not 
so  well  respect  to  Comprehensiveness.  The  possible  reason 
is  that  we  excessive  emphasis  on  the  relevance  between  topic 
and  sentence,  and  ignored  the  comprehensiveness  of  topic. 
Therefore,  the  future  work’s  emphasis  should  be  on  how  to 
improve  the  Comprehensiveness  (or  recall). 
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